Skip to content

Add coordinator module and corresponding unit test#1465

Open
NumberWan wants to merge 14 commits intovllm-project:mainfrom
NumberWan:omni-coordinator
Open

Add coordinator module and corresponding unit test#1465
NumberWan wants to merge 14 commits intovllm-project:mainfrom
NumberWan:omni-coordinator

Conversation

@NumberWan
Copy link

Part of #984 - adds OmniCoordinator module

Summary

Add OmniCoordinator module for distributed stage instance coordination.

Related

Part of #984 - this PR implements the coordinator module.

ADD

  • OmniCoordinator: receives instance events via ZMQ ROUTER, publishes via PUB
  • OmniCoordClientForStage: stage instances send registration/heartbeat
  • OmniCoordClientForHub: hub side caches instance list
  • LoadBalancer: RandomBalancer for instance selection
  • Unit tests for above components

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1dd914a75a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

… test

Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Main concern is the TOCTOU race in _handle_event — the check-then-act on self._instances without holding the lock can cause duplicate registrations under concurrent events. The rest are smaller cleanups (license headers, type annotations, hardcoded test ports).

…types, dynamic ports

Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
…types, dynamic ports

Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
…types, dynamic ports

Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #1465 Review: Add coordinator module and corresponding unit test

Overview

This PR implements the OmniCoordinator module for distributed stage instance coordination (Major PR 1 from #984). The implementation closely follows the design document with a few important deviations.

Critical Issues: 4 found

1. Instance Registry Never Removes DOWN Instances (Design Deviation)

File: omni_coordinator.py:271-276

The design document states instances should be "removed from the internal registry" when DOWN, but the implementation only marks them as DOWN and keeps them forever:

def _remove_instance_locked(self, event: InstanceEvent) -> None:
    info.status = StageStatus.DOWN  # Only marks, never removes

Impact: Memory leak, unbounded registry growth over time.


2. ZMQ Context Singleton Causes Test/Production Issues

Files: omni_coordinator.py:51, omni_coord_client_for_stage.py:27, omni_coord_client_for_hub.py:28

Using zmq.Context.instance() shares a global context across all components:

self._ctx = zmq.Context.instance()  # Global singleton

Impact: Tests interfere with each other, closing one coordinator can break others, cannot run multiple coordinators.

Recommendation: Use per-coordinator context: self._ctx = zmq.Context()


3. No Connection State Machine / Reconnection Logic

Files: omni_coord_client_for_stage.py, omni_coord_client_for_hub.py

The design mentions reliability with retry mechanisms, but there's no reconnection logic if the ZMQ connection drops. Network blips will cause silent failures with no recovery path.

Recommendation: Add connection state machine with exponential backoff reconnection.


4. Scalability Limit Not Documented

The single-threaded coordinator has inherent scalability limits:

Instances Messages/sec Status
< 500 < 150 ✅ OK
500-2000 150-600 ⚠️ Caution
> 2000 > 600 🔴 Bottleneck risk

Recommendation: Add max_instances parameter or document expected limits.


Strengths

  • ✅ Good test coverage matching design spec
  • ✅ Message schemas match design document exactly
  • ✅ Clean separation of concerns (messages, coordinator, clients)
  • ✅ Thread-safe with proper locking
  • ✅ ZMQ best practices (RCVTIMEO, NOBLOCK)

Recommendation

Address issues #1 and #2 before merge. Issue #3 (reconnection) could be documented as Phase 2 work if needed.

Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
…ontexts

Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
… when disconnected

Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
… when disconnected

Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All threads resolved LGTM

@lishunyang12
Copy link
Contributor

All my previous comments are addressed — thanks for the quick fixes. I'll defer to @hsliuustc0106 on the remaining items from their review (instance removal, reconnection logic).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants